﻿ About the eﬀects of using Anaphora Resolution in assessing free-text student answers Diana P´erez1, Oana Postolache3, Enrique Alfonseca1, Dan Cristea2andPilar Rodriguez1 123 Dpt of Computer ScienceDpt of Computer ScienceDpt of Computational Linguistics U A M (Spain) U of Iasi (Romania) U of Saarland (Germany) {Diana Perez,Enrique Alfonseca,Pilar Rodriguez}@uam es dcristea@infoiasi roandoana@coli uni-sb de Abstract•Full Natural Language Processing In this paper we present a possibility for inte-(NLP): NLP techniques, such as parsing and grating Anaphora Resolution (AR) in a systemrhetorical analysis, can be used to gather to automatically evaluate students’ free-text an- more information about the student’s an- swers An initial discussion introduces some of the several methods that can be tried out Theswer A system that applies NLP techniques implementation makes use of the AR-Engineis C-rater (Bursteinet al 01) RARE (Cristeaet al 02), integrated into the free-text answers assessor Atenea (Alfonseca &•Clustering: grouping essays that have simi- P´erez 04) to test these methods RARE haslar words patterns to form a cluster with the been applied to ﬁnd coreferential chains, and itsame score This is the approach followed by has been found useful to extend the set of ref- erence answers used by Atenea, by generatingthe Intelligent Essay Marking System (Ming automatically new correct answers et al 00) •Hybrid approaches: they combine several techniques to achieve better results For in- 1 Introduction stance, E-rater (Bursteinet al 98) and Ate- Computer Assisted Assessment (CAA) is a ﬁeldnea (Alfonseca & P´erez 04) use statistical that studies how a computer can be used to as-and NLP techniques sess students One of its subﬁelds, that has re-Although the techniques may seem very diﬀer- cently attracted much attention, focuses on as-ent, the general idea that underpins all these sys- sessing free-text answers It is a quite complextems is the same: to compare the student’s answer task, still far from being completely solved Thus,(or candidate answer) with the teacher’s ideal an- many systems are being developed, relying on var-swer (or reference answer) The closer they are, ious techniques A classiﬁcation of these tech-the higher the student’s score is niques with examples of existing systems that useA problem to be able to compare the results them is given in (Perez 04):of all these systems with each other is that, cur- •Statistical techniques: they are based onrently, there are not any standard evaluation cor- some kind of statistical analysis, such as wordpora and metrics Concerning the evaluation met- frequency counts, or Latent Semantic Anal-rics, there is a trend of using the Pearson correla- ysis (LSA) (Landaueret al 01) tion between the teachers’ and the system’s scores •Text Categorization Techniqueson the same data set (Valentiet al 03; Perez 04) (TCT): the student’s answer can beThe state-of-the-art results are between 30% and classiﬁed as right or wrong, or inside a93%, because the corpora used have very diﬀerent category in a scale of grades (e g bad,degrees of diﬃculty intermediate, good, very good) TCT, e g Among the NLP techniques that can be Bayesian networks, can be applied in this employed to improve the automatic assessing case (Larkey 98) of open-ended questions, Anaphora Resolution •Information Extraction techniques: (AR), the process of ﬁnding the antecedent of an they are used by systems which acquire anaphor, could be considered as well This lan- structured information from free text, for guage phenomenon, consisting of referring to a example dependencies between concepts as previously mentioned entity, is quite common in in Automark (Mitchellet al 02) written language (Vicedo & Ferr´andez 00) More- This work has been sponsored by Spanish Ministry ofover, it has successfully been applied to other Science and Technology, project number TIN2004-0314 ﬁelds (Cristeaet al 05) Previous authors have also mentioned that ARnative answers for the same question, or au- will probably be useful for free-text CAA (Valentitomatically, for instance, expanding the text et al 03) However, to our knowledge, still therewith synonyms of the words used, or using are no studies indicating the impact of applyingAR, as described below AR to automatic assessment of free-text answers Concerning thereduction of paraphrasing Therefore, the main motivation of this paper is toin a text, it is well known that there are many study the eﬀects of using AR integrated with thediﬀerent expressions that have the same meaning Atenea system The AR-engine chosen is RAREOne of the sources for paraphrasing stems from (Cristeaet al 02) Our initial hypothesis wasthe fact that there are many ways to refer to a pre- that somehow it would improve the accuracy ofviously mentioned entity by using an anaphoric the assessment expression AR could help by identifying the ref- The ﬁrst step to accomplish our aim has beenerential expressions (REs) for the same referents, to decide the way in which AR will be integratedand gathering them in coreferential chains Once with Atenea The experimental framework givencoreferential chains are found, we have designed by the integration of RARE in Atenea has madethree ways in which they can be used: possible to try several diﬀerent uses of AR for1 First-NP: Each NP in the candidate and in free-text CAA The indicator of the appropriate-the reference answers is substituted for the ness of the procedure has been measured with theﬁrst NP in the coreferential chain The aim Pearson correlation between the teachers’ and theis to ﬁlter the paraphrasing by substituting system’s scores The results show that the appli-all NPs which refer to the same concept for cation of AR directly on the student’s answersthe ﬁrst NP used does not improve the results in our case On theFor instance, let us suppose that we are scor- other hand, AR has been found useful for gener-ing the candidate answer ating automatically many alternative references (1) Unix is an operating system It is mul- and in this way, it slightly increases Atenea’s as- tiuser sessment accuracy and we apply this method to help in the com- The paper is organized as follows: Section 2 parison between this text and the references presents the description of the possible uses that The AR-engine RARE says thatUnix,op- AR has in CAA of free-text answers In Section 3, erating systemandItare coreferential REs the implementation used in the experiments to Therefore, all of them will be substituted by test the previously mentioned methods is shown the ﬁrst RE (Unix) Therefore, the answer Finally, in Section 4 several conclusions are drawn will be transformed into and future work is outlined (2) Unix is Unix Unix is multiuser 2 Possible uses of AR in free-textNote that RARE considers the relationship CAAbetween the subject and the predicative noun as coreferential as indicated in the MUC an- Most of the systems for evaluating open-endednotation guidelines (Hirschmanet al 97) questions compare the student’s candidate answer2 All-NPs: Each NP in the candidate and the with reference answers written by the teachers reference answers is substituted for the whole Therefore, the system will not be able to evalu-coreferential chain to which it belongs In ate correctly an answer, if the word choice or thethis way, the candidate and reference answers expression used by the student and the teacherwill match if the intersection between the are diﬀerent We can try to solve this problem oncoreferential chains, considered as sets, is not both sides:empty The third person singular personal •Reducing the possible paraphrasings of eachpronounsitare excluded from these chains text, for instance, by eliminating all thebecause most of the coreferential chains con- pronouns and some deﬁnite NPs, usingtain it Anaphora Resolution Thus, the candidate answer (1) will be trans- •Extending the set of references with alterna-formed into tive paraphrasings This can be done manu-(3){an operating system,Unix}is{an op- ally by asking several teachers to write alter-erating system,Unix} {an operating system,Unix}is multiuserthe following NLP techniques, using thewraetlic 3 Only-it: Only theitpronouns in the can-tools (Alfonseca 03)1: didate and the reference answers are sub-•Stemming: To be able to match inﬂected stituted for the ﬁrst NP in the coreferentialnouns or verbs chain which is not anit This has been con-•Removal of closed-class words: To be sidered relevant enough to be studied givenable to ignore them the extremely high frequency of this pronoun•Word Sense Disambiguation: To identify in the student answers in our test sets Thisthe sense intended by both the teacher and technique will also avoid the problem men-the student tioned before with the predicative NPs Then, the processed answers enter in the com- Thus, the resulting candidate answer for (1)parison module (ERB) that calculates the stu- would bedent’s score and generates the student’s feedback (4) Unix is an operating system Unix isThis module is based on a modiﬁcation of the n- multiuser gram co-occurrence scoringBleu algorithm (Pa- Concerning the creation of newreference an-pineniet al 01) The modiﬁcation is necessary swerswith alternative paraphrasings, we haveto take into account not only the precision but also considered the possibility of applying AR inalso the recall (Alfonseca & P´erez 04) The pseu- this task While in the previous methods AR wasdocode of ERB is as follows: applied to both the candidate and the reference1 For each value ofN(typically from 1 to answers, in this method it only aﬀects the refer-3), calculate the Modiﬁed Uniﬁed Precision ence answers The motivation is that the qual-(M U PN) as the percentage ofN-grams from ity of the references is crucial, since they are thethe candidate answer that appears in any of texts to which the students’ answers are com-the reference texts It will be clipped by the pared Therefore, the usual practice of gettingmaximum frequency with which it appears in new references is to ask teachers to write theseany of the references references 2 Calculate the weighted linear average of However, as this is very cost and time consum-M U PNobtained for each value ofN Store ing, we have also considered the automatic gen-it incombM U P eration of new reference answers It can be done3 Calculate the Modiﬁed Brevity Penalty by replacing automatically the NPs in the coref-(M BP) factor, which is intended to penal- erential chains with other referential entities ofize answers with a very high precision, but those NPs For instance, if we consider that (1)which are too short, to measure the recall: is a reference written by a teacher, two new refer-(a) ForNfrom a maximum value (e g 10) ences can be generated from its coreferential chaindown to 1, look whether eachN-gram [Unix,an operating system,it]:from the candidate text appears in any reference In that case, mark the words (5) from the foundN-gram, both in the can- Unix is an operating system Unix is mul-didate and in the reference tiuser (b) For each reference text, count the num- ber of words that are marked, and calcu- Unix is an operating system An operatinglate the percentage of the reference that system is multiuser has been found in the student’s answer (c) TheM BPfactor is the sum of all those 3 Implementation percentage values 3 1 Atenea4 The ﬁnal score is the result of multiplying the M BPfactor byecombM U P Atenea (Alfonseca & P´erez 04) is a CAA system The answer will be returned to the student, to- for automatically scoring students’ short answers gether with a score and a feedback based on a It has already been tested with English and Span- color code, in which the parts of the student’s an- ish texts and it could be easily ported to other swer which appear in the references are marked languages It works by processing the student’s and teacher’s answers according to several or all of1Available at www ii uam es/˜ealfon/eng/download html Figure 2: RARE layers Figure 1: Feedback for the student, and score •A set of knowledge resources: such as a part-of-speech tagger and an NP extractor to with a darker background (see Figure 1) ﬁll in the primary attributes to be stored in the PSs 3 2 RARE•A set of heuristics or rules: for each RE they decide if it refers to a new DE or to an RARE (Robust Anaphora Resolution Engine) al- already existing one lows the design, implementation and evaluation of •A domain of referentiality: it says where, diﬀerent multilingual anaphora resolution models how many and the order in which the DEs on free texts The engine (Cristeaet al 02; Pos- have to be checked tolache & Forascu 04) has successfully been inte- The phases in the processing done by RARE grated into a discourse parser (Cristeaet al 05) are as follows (see Figure 2): and a time tracking approach (Puscasu 04) It 1 A referential expressionREais projected allows postponed resolution and deals with sev- from the text layer into a feature structure eral varieties of anaphora from only pronominal P Saon the projection layer At this mo- anaphora to more complex types such as bridging ment, the engine searches the space of ex- anaphora The information is organized in RARE isting discourse entities in order to recognize on three layers: one against which the newly projected struc- 1 The text layer: It is composed by the wordsture matches the best that form the discourse and it is populated2 If no such DE is found, the projected struc- with the referential expressions (REs) FortureP Sis transformed in a new discourse b example, in the candidate answer “Unix is anentityDE, on the semantic layer, and disre- a operating system It is multiuser”, “Unix”,garded from the projection layer As the text “operating system” and “it” are the REs unfolds, a new referential expressionREcan b 2 The projection layer: This layer stores in-be found on the text layer and, in its turn, formation about the found REs in featureprojected asP S b structures called projection structures (PSs)3 IfP Smatches an already existing discourse b to help in determining which ones are coref-entityDE, with the meaning that their a erential respective referential expressions,REand a 3 The semantic layer: The REs representRE, are coreferential If this happens,P S bb entities from the real world The underlyingis combined withDEand, subsequently, is a meaning of the REs is treated in the semanticdisregarded from the projected layer layer on the form of Discourse Entities (DEs) 4 Finally, chains of coreferential expressions It is said that a PS is projected from an REare linked to the same object of the seman- and a DE is proposed or evoked by a PS Thetic layer, signifying that a unique discourse process should be done from left to right in lan-entity is evoked by all REs of the chain guages that are read in that way and vice versa from those read from right to left Irrespectively3 3 Techniques to use RARE in Atenea of the language, the necessary features for any ARThe use of RARE as a new NLP module in Ate- model to be used in RARE are (Cristea & Dimanea requires the introduction of a new pre-initial 01):phase to perform the pre-processing necessary to •A set of primary attributes: indicat-RARE This phase includes a Functional Depen- ing, for example, morphological, syntactic,dency Grammar (FDG) parsing of the text and semantic or positional information the transformation of its result into an intermedi- SET NC MC NR MR Type 1 79 51 3 42 Def 2 143 48 7 27 A/D 3 295 56 8 55 A/D 4 117 127 5 71 Y/N 5 38 67 4 130 Def Table 1: Answer sets used in the evaluation Columns indicate: set number; number of can- didate texts, mean length of the candidate texts, number of references, mean length of the refer- ences and the type of question (Def =deﬁnitions; A/D=advantages/disadvantages; Y/N=justiﬁed Figure 3: Example of the generation of new refer-Yes/No) The length is measured in number of ences from the original text “Unix is an operatingwords system It is multi-user It is easy to use” Set ERB First-NP All-NPs Only-It 1 0 5323 0 5217 0 2506 0 5176 ate format understandable by RARE and Atenea 2 0 6442 0 5984 0 5107 0 6337 This format is table in which each row represents3 0 2201 0 1731 0 0209 0 1529 a chain and, for each row, there are as many cells4 0 3121 0 2102 0 1878 0 2222 as NPs are in the chain For the example candi-5 0 5868 0 5799 0 02390 5941 date text (1) from Section 2, the equivalence tableMean 0 4591 0 4167 0 1968 0 4241 would have just one row (as it only has one chain) and it would be:[Unix, an operating system, it] Table 2: Results achieved using Atenea only with The next step varies according to the methodRARE (without any other NLP module) to re- chosen:duce the paraphrasing •First-NP: each NP found in a row of the equivalence table is replaced by the ﬁrst NP which is not an “it” in the chain “it” found has been replaced by each possible •All-NPs: each NP found in a row of theRE equivalence table is replaced by the whole5 Go back to the second step chain as a set Figure 3 shows an execution example •Only-it: each non-pleonastic “it” found in a row of the equivalence table is replaced by3 4 Evaluation the ﬁrst NP which is not an “it” in the coref- For evaluation purposes, we have used a cor- erential chain pus composed of four sets of answers written by Secondly, to implement the procedure for auto- Spanish students in real exams about Operative matically generatingnew paraphrases of the ref- Systems In previous work, we have observed that erence texts, the following pseudocode has been the variation in accuracy of Atenea is not statis- used It starts with one reference text that has tically signiﬁcative when both the candidate and been written by hand by a teacher the reference texts are translated with an MT tool 1 Initialize an empty arraygenRef T extswith (P´erezet al 05) Given that RARE is currently the reference text available just for English, the corpus was auto- 2 Look for the next non-pleonastic “it” If none matically translated to English using Altavista is found, stop 2 Babelﬁsh Besides, a set of deﬁnitions ofOper- 3 Identify the row of the table that contains the ating System, retrieved fromGoogle glossary, has coreferential chain which includes the “it” also been added as a ﬁfth test set The ﬁve data pronoun found sets are described in Table 1 The FDG-parsing 4 Create as many copies of all the references of these data sets was done with the on-line demo ingenRef T extsas NPs exist in the corefer- ential chain For each of the copies, the last2http://world altavista com/ N ERB S C S+C W W+C 1 0 5323 0 4337 0 5479 0 5310 0 4176 0 4841 2 0 6442 0 6899 0 6066 0 7567 0 6998 0 7655 3 0 2201 0 2426 0 3213 0 3459 0 2358 0 3282 4 0 3121 0 3326 0 3450 0 3754 0 3150 0 3586 5 0 5868 0 6007 0 5663 0 5702 0 6194 0 5919 Mean 0 4591 0 4599 0 4774 0 5158 0 4575 0 5057 Table 3: Results achieved using Atenea without RARE N Generated refs ERB S C S+C W W+C 1 3 0 52120 4688 0 5824 0 5501 0 4405 0 4951 2 8 0 6442 0 63550 66670 7094 0 6537 0 7199 3 170 22180 2370 0 3083 0 3390 0 2255 0 3238 4 13 0 2918 0 28530 3806 0 42330 27450 4182 5 360 5964 0 61410 56070 5903 0 6208 0 6054 Mean 15 4 0 4551 0 44810 4997 0 52240 4430 5125 Table 4: Results achieved using Atenea with RARE of Connexor3 indicate the conﬁguration of Atenea that has Reduction of paraphrasingThe ﬁrst experi-been used: no NLP processing (ERB), stemming ment explores the impact of the reduction of para-(S), removal of closed-class words (C), stemming phrasing both in the candidate answers and theand removal of closed-class words (S+C), Word references The correlation between the teachers’Sense Disambiguation (W) and Word Sense Dis- and the system’s scores has been calculated usingambiguation with removal of closed-class words the diﬀerent settings of the system (W+C) Table 2 shows the correlation values of AteneaIt can be seen that the use of RARE has im- without using any other NLP modules than AR proved three of the ﬁve conﬁgurations under test The second column indicates the correlation ob-(C, S+C and W+C) In fact, the best combina- tained for each dataset with just ERB, and thetion is the use of stemming, closed-class word re- next three columns contain the results for each ofmoval and the RARE-generated references the three experiments The bold font ﬁgure indi- 4 Conclusions and future work cates the case in which using RARE has improved the result over the original ERB In this paper, Anaphora Resolution has been ap- Contrary to our intuition, the results showplied to the task of automatically assessing stu- that there is no signiﬁcant improvement in usingdents’ free-text answers In particular, the AR- RARE and, in some cases, such as in theall-NPsengine RARE has been integrated into Atenea, to method, the correlations decrease for all data sets test four proposed methods:ﬁrst-NP, in which Therefore, our conclusion is that AR is not use-the NPs are replaced by the ﬁrst RE which is not ful to improve the results of n-gram co-occurrencethe “it” pronoun;all-NPs, in which the NPs in similarity metrics the candidate and reference’s texts are replaced by the whole coreferential chain;only-it, in which Creation of new referencesRARE has also only the “it” pronouns are replaced by the ﬁrst been used to create new references by substituting RE; and theautomatic generation of vari- the non-pleonastic it pronouns with all its Refere- able referencesfrom the original reference text, nial Expressions Tables 3 and 4 show the correla- to automatically obtain new variants by replacing tion values for the ﬁve evaluation data sets using each non-pleonastic “it” with all the possible NPs the old sets of references and the extended set in its coreferential chain of references, respectively The diﬀerent columns The results show that, although AR has suc- 3 http://www connexor com/cessfully been used to several ﬁelds, the additional pre-processing necessary to run RARE is worth-(Cristea & Dima 01) D Cristea and G E Dima An integrating amework for anaphora resolution Information Science and less when combined with the statistical procedurefrTechnology, 4(3), 2001 in which Atenea is based This may be motivated (Cristeaet al 02) D Cristea, O Postolache, G E Dima, and by several reasons On the one hand, the num-C Barbu Ar-engine - a framework for unrestricted co-reference resolution InProceedings of the 3rd International Conference ber of times that the candidate and the referenceon Language Resources and Evaluation (LREC), 2002 matches may be artiﬁcially inﬂated when the ref-(Cristeaet al 05) D Cristea, O Postolache, and I Pistol Summari- on through discourse parsing InProceedings of CICLING erential NPs are substituted by their REs This issati2005, 2005 specially evident in the all-NPs experiment On (Hirschmanet al 97) Hirschman, Lynette, and Chinchor Muc-7 the other hand, we believe that there has not beencoreference task deﬁnition, version 3 0 InMUC-7 Proceedings, 1997 See also: http://www muc saic co much improvement because of the characteristics andaueret al 01) T K Landauer, D Laham, and P W Foltz of the n-gram co-occurrence metric used For in-(LThe intelligent essay assesor: putting knowledge to the test In oceedings of the Association of Test Publishers Computer- stance, let us consider the following example sen-Pr Based Testing: Emerging Technologies and Opportunities for tences:Diverse Applications conference, 2001 (Larkey 98) L S Larkey Automatic essay grading using text cat- (6) a Unix is an Operating System It is easyegorization techniques InProceedings of the 21st Annual In- ternational ACM SIGIR Conference on Research and Devel- to useopment in Information Retrieval, pages 90–95, 1998 nget al 00) Y Ming, A Mikhailov, and T L Kuan Intelligent b Unix iseasy It isan Operating System (Mi essay marking system Learners Together, 2000 tchellet al 02) T Mitchell, T Russell, P Broomhead, and Even that the pronoun and its REs are used in(Mi N Aldridge Towards robust computerised marking of free-text diﬀerent places in the ﬁrst one and the secondresponses, 2002 one, if we consider the ﬁrst one as the reference(Papineniet al 01) K Papineni, S Roukos, T Ward, and W Zhu Bleu: a method for automatic evaluation of machine translation and the second one as the candidate, we can seeResearch report, IBM, 2001 that all of the n-grams in the candidate appear(P erez 04) D Perez Automatic evaluation of users’ short essays by somewhere in the reference text using statistical and shallow natural language processing tech- niques Advanced Studies Diploma (Escuela Polit´ecnica Supe- Therefore, this does not imply that AR is notrior, Universidad Aut´onoma de Madrid), 2004 useful in CAA in general, but that it should not(P´erezet al 05) D P´erez, E Alfonseca, and P Rodr´guez Adapting the automatic assessment of free-text answers to the students be used with BLEU-like algorithms proﬁles InProceedings of the CAA conference, Loughborough, K , 2005 Concerning the generation of new references,U ostolache & Forascu 04) O Postolache and C Forascu A coref- the results are slightly better, and the average cor-(P erence model on excerpt from a novel InProceeding of The relation increases up to 52% Furthermore, thisEuropean Summer School in Logic Language and Information - ESSLLI’2004, Nancy, France, 2004 method opens a promising line of future work that (Puscasu 04) G Puscasu A framework for temporal resolution In could be further exploited to automatically gener-Proceedings of the Language Resources and Evaluation Con- rence (LREC-2004), 2004 ate new references (for instance, with synonyms offe the words in the references) Other lines of future(Valentiet al 03) S Valenti, F Neri, and A Cucchiarelli An overview of current research on automated essay grading Jour- work are the following: to improve the AR modelnal of Information Technology Education, 2:319–330, 2003 with features speciﬁc to the types of answers to be(Vicedo & Ferr´andez 00) J L Vicedo and A Ferr´andez Impor- nce of pronominal anaphora resolution to question answering processed, to ﬁnish the development of the Span-tasystems InProceedings of the 38th Annual Meeting of the As- ciation for Computational Linguistics (ACL), pages 555–562, ish anaphora resolution model for RARE, and toso2000 test more possibilities for using RARE with Ate- nea References (Alfonseca & P´erez 04) E Alfonseca and D P´erez Automatic as- sessment of short questions with ableu-inspired algorithm and shallow nlp InAdvances in Natural Language Processing, vol- ume 3230 ofLecture Notes in Computer Science, pages 25–35 Springer Verlag, 2004 (Alfonseca 03) E Alfonseca Wraetlic user guide version 1 0, 2003 (Bursteinet al 98) J Burstein, K Kukich, S Wolﬀ, C Lu, M Chodorow, L Bradenharder, and M Dee Harris Automated scoring using a hybrid feature identiﬁcation technique InPro- ceedings of the Annual Meeting of the Association of Compu- tational Linguistics, 1998 (Bursteinet al 01) J Burstein, C Leacock, and R Swartz Auto- mated evaluation of essays and short answers InProceedings of the International CAA Conference, 2001 